Football is a very exciting sport. Until now, this is the most popular game on the entire Earth planet. Sorry, not sorry about other games.

I want to review data collected since 1872 trying to understand how matches between countries have evolved up to this moment. So, we are calling to R and a few libraries to help us visualizing data:

library(tidyverse)
library(plotly)
library(lubridate)

Dataset

The first thing is to read files. I downloaded this project at 2021-07-22 from Kaggle.

results <- read.csv("results.csv", encoding = "UTF-8")

This dataset contains data about \(42k+\) football matches in the history of international encounters between national teams. So, let’s take a little taste of the data:

head(results)

Analysis

One interesting thing is to take a look at the context of the matches, some of them could be not relevant at all, however, there is also World cup matches, continental tournaments, and so on:

levels(as.factor(results$tournament)) -> tournaments
sample(tournaments,20)
##  [1] "Simba Tournament"                          
##  [2] "Island Games"                              
##  [3] "Pan American Championship"                 
##  [4] "Atlantic Heritage Cup"                     
##  [5] "United Arab Emirates Friendship Tournament"
##  [6] "CFU Caribbean Cup qualification"           
##  [7] "Inter Games Football Tournament"           
##  [8] "CONCACAF Nations League"                   
##  [9] "Intercontinental Cup"                      
## [10] "UEFA Euro"                                 
## [11] "CONCACAF Championship qualification"       
## [12] "Dunhill Cup"                               
## [13] "Nile Basin Tournament"                     
## [14] "Brazil Independence Cup"                   
## [15] "Copa América"                              
## [16] "Copa Félix Bogado"                         
## [17] "EAFF Championship"                         
## [18] "Copa Roca"                                 
## [19] "Atlantic Cup"                              
## [20] "COSAFA Cup"

Filtering by tournaments with at least 100 matches played in the history:

results %>%
  group_by(tournament) %>%
  summarise(count=n()) %>%
  filter(count > 100) %>%
  select(tournament) -> popularCups

results %>%
  filter(tournament %in% popularCups$tournament) %>%
  ggplot(aes(x=tournament, fill=tournament)) +
  geom_bar() +
  coord_flip() +
  labs(title="Matches in tournaments") -> p 
ggplotly(p)

Now we need to process a little bit of the data to assign a standard way to provide points based on the outcome of every match:

Points Outcome
\(3\) Victory
\(1\) Tie
\(0\) Defeat

In FIFA scores, 2 points can be achieved by winning a shootout after a tied match, however, I ignored that for the following analysis

Let’s take a look on how it looks now:

results %>%
  mutate(tied=ifelse(home_score == away_score,TRUE,FALSE)) %>%
  mutate(home_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,3,0))) %>%
  mutate(away_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,0,3))) -> results

results %>%
  filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
head(worldCupResults)

After this step we also need to transform a little bit the structure of this dataset in order to measure the performance of each National Team in this way:

Then we can see how it looks (for tournaments that contain "FIFA World Cup" in its name).

results %>%
  pivot_longer(c(home_team,away_team),names_to = "homeaway", values_to = "team") %>%
  mutate(points=ifelse(grepl("home",homeaway),home_points,away_points),
         goals=ifelse(grepl("home",homeaway),home_score,away_score),
         receivedGoals=ifelse(grepl("home",homeaway),away_score,home_score)) %>%
  select(date,tournament,country,team,points,goals,receivedGoals) -> results

results %>%
  filter(grepl("FIFA World Cup",tournament)) -> worldCupResults

FIFA World Cup (and qualifiers)

The most interesting matches occur at FIFA World Cup. So we can focus on what happens in this tournament:

worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% ggplot(aes(x=yr, y=performance, fill=team)) + geom_bar(stat="identity") -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)

Germany emerges as the best in performance over all the matches related to the World Cup. Is not a surprise at all, remember all of the “goleadas” that has produced, in the qualifiers as well as in the knock-out matches in the final stages of the tournament.

Now we can take a look at what happens if we focus only on the final stage, I mean filtering out the qualifiers:

worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% filter(team %in% c("Mexico","Brazil","Argentina","Germany","France")) %>% ggplot(aes(x=yr, y=performance, color=team)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% mutate(differenceGoal=ofensive-defense) %>% ggplot(aes(x=yr, color=team, y=differenceGoal)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)